NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

k -nonical space: sketching with reverse complements

https://doi.org/10.1093/bioinformatics/btae629

Marçais, Guillaume; Elder, C_S; Kingsford, Carl; Nikolski, ed., Macha (October 2024, Bioinformatics)

Abstract MotivationSequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). ResultsThe effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome (“sketching deserts”) are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space. Availability and implementationThe code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.
more » « less
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets

https://doi.org/10.1089/cmb.2024.0544

Marçais, Guillaume; DeBlasio, Dan; Kingsford, Carl (July 2024, Journal of Computational Biology)

Full Text Available
Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme

https://doi.org/10.1089/cmb.2023.0212

Hoang, Minh; Marçais, Guillaume; Kingsford, Carl (January 2024, Journal of Computational Biology)

Full Text Available
Creating and Using Minimizer Sketches in Computational Genomics

https://doi.org/10.1089/cmb.2023.0094

Zheng, Hongyu; Marçais, Guillaume; Kingsford, Carl (January 2023, Journal of Computational Biology)

Full Text Available
Sequence-specific minimizers via polar sets

https://doi.org/10.1093/bioinformatics/btab313

Zheng, Hongyu; Kingsford, Carl; Marçais, Guillaume (July 2021, Bioinformatics)

Abstract Motivation Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. Results We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. Availability and implementation A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Sketching and Sublinear Data Structures in Genomics

https://doi.org/10.1146/annurev-biodatasci-072018-021156

Marçais, Guillaume; Solomon, Brad; Patro, Rob; Kingsford, Carl (July 2019, Annual Review of Biomedical Data Science)

Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full-text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.
more » « less
Full Text Available

Search for: All records